Over decades, linguists have shown that language form and language in context could be different in people with different background. Today, I conduct data analysis on HappyDB to explore the variation of written text. HappyDB is a corpus of 100,000 crowd-sourced happy moments which people shared in the past three months or past 24 hours (https://rit-public.github.io/HappyDB/ ).
In this study, I would like to investigate how the happy moments change among people in different countries with different age and different marital status.
Length and Topics are the main keys in a sentence, therefore I use exploratory data analysis (boxplot, line plot, and scatterplot), word clouds, bigrams and topic modeling to analyze on the two aspects.
Note: Text Processing are based on Professor Ying Liu, Arpita Shah and Tian Zheng’s tutorial
library(tm)
library(tidytext)
library(tidyverse)
library(DT)
library(scales)
library(wordcloud2)
library(gridExtra)
library(ngram)
library(shiny)
urlfile<-'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/cleaned_hm.csv'
hm_data <- read_csv(urlfile)
We clean the text by converting all the letters to the lower case, and removing punctuation, numbers, empty words and extra white space.
corpus <- VCorpus(VectorSource(hm_data$cleaned_hm))%>%
tm_map(content_transformer(tolower))%>%
tm_map(removePunctuation)%>%
tm_map(removeNumbers)%>%
tm_map(removeWords, character(0))%>%
tm_map(stripWhitespace)
Stemming reduces a word to its word stem. We stem the words here and then convert the “tm” object to a “tidy” object for much faster processing.
stemmed <- tm_map(corpus, stemDocument) %>%
tidy() %>%
select(text)
We also need a dictionary to look up the words corresponding to the stems.
dict <- tidy(corpus) %>%
select(text) %>%
unnest_tokens(dictionary, text)
We remove stopwords provided by the “tidytext” package and also add custom stopwords in context of our data.
data("stop_words")
word <- c("happy","ago","yesterday","lot","today","months","month",
"happier","happiest","last","week","past")
stop_words <- stop_words %>%
bind_rows(mutate(tibble(word), lexicon = "updated"))
Here we combine the stems and the dictionary into the same “tidy” object.
completed <- stemmed %>%
mutate(id = row_number()) %>%
unnest_tokens(stems, text) %>%
bind_cols(dict) %>%
anti_join(stop_words, by = c("dictionary" = "word"))
Lastly, we complete the stems by picking the corresponding word with the highest frequency.
completed <- completed %>%
group_by(stems) %>%
count(dictionary) %>%
mutate(word = dictionary[which.max(n)]) %>%
ungroup() %>%
select(stems, word) %>%
distinct() %>%
right_join(completed) %>%
select(-stems)
We want our processed words to resemble the structure of the original happy moments. So we paste the words together to form happy moments.
completed <- completed %>%
group_by(id) %>%
summarise(text = str_c(word, collapse = " ")) %>%
ungroup()
hm_data <- hm_data %>%
mutate(id = row_number()) %>%
inner_join(completed)
datatable(hm_data)